Native Language Identification: a Simple n-gram Based Approach
نویسندگان
چکیده
This paper describes our approaches to Native Language Identification (NLI) for the NLI shared task 2013. NLI as a sub area of author profiling focuses on identifying the first language of an author given a text in his second language. Researchers have reported several sets of features that have achieved relatively good performance in this task. The type of features used in such works are: lexical, syntactic and stylistic features, dependency parsers, psycholinguistic features and grammatical errors. In our approaches, we selected lexical and syntactic features based on n-grams of characters, words, Penn TreeBank (PTB) and Universal Parts Of Speech (POS) tagsets, and perplexity values of character of n-grams to build four different models. We also combine all the four models using an ensemble based approach to get the final result. We evaluated our approach over a set of 11 native languages reaching 75% accuracy.
منابع مشابه
Native Language Identification: A Key N-gram Category Approach
This study explores the efficacy of an approach to native language identification that utilizes grammatical, rhetorical, semantic, syntactic, and cohesive function categories comprised of key n-grams. The study found that a model based on these categories of key n-grams was able to successfully predict the L1 of essays written in English by L2 learners from 11 different L1 backgrounds with an a...
متن کاملBMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach for Native Language Identification
Native Language Identification (NLI) aims to identify native language L1 of an author by analysing the text written by him/her in other language L2. NLI is often implemented as a supervised classification problem. In this paper, we report a NLI system implemented using character tri-grams, word uni-grams and bigrams methods using linear classifier, Support Vector Machines (SVM). The work demons...
متن کاملSimple Yet Powerful Native Language Identification on TOEFL11
Native language identification (NLI) is the task to determine the native language of the author based on an essay written in a second language. NLI is often treated as a classification problem. In this paper, we use the TOEFL11 data set which consists of more data, in terms of the amount of essays and languages, and less biased across prompts, i.e., topics, of essays. We demonstrate that even u...
متن کاملExploring Adaptor Grammars for Native Language Identification
The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor gramm...
متن کاملNative Language Identification with PPM
This paper reports on our work in the NLI shared task 2013 on Native Language Identification. The task is to automatically detect the native language of the TOEFL essays authors in a set of given test documents in English. The task was solved by a system that used the PPM compression algorithm based on an n-gram statistical model. We submitted four runs; word-based PPMC algorithm with normaliza...
متن کامل